In this exercise, we will required the tidyverse, janitor, and gt packages. We will be exploring a data file, and introduce a few data manipulation options for data cleaning.

library(tidyverse)
library(janitor)
library(gt)

The goal of this exercise is to explore and consider how to clean a messy dataset. The Metropolitan Museum of Art in New York City maintains a database of more than 470,000 artworks. For the purposes of this exercise, we are going to focus on a small sample of objects in a file which requires some data cleaning.

This exercise is structured to encourage you to perform and explore each cleaning step separately and then combine them at the end. In practice, you are welcome to add each step as you go.

(a) Read in the data set.

The file is called MetUnclean.csv and you can read the file in using the read_csv function. Choose `met_unclean’ as the name for the data frame if you want to be consistent with the rest of the exercise and the solutions we provide.

met_unclean <- read_csv("MetUnclean.csv")
Rows: 11 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Department, Object Title, Artist_Name, Artist_Nationality, Medium
dbl (2): Artist_Birth_Year, Object_Age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

(b) Have a look at the data file and see if you can identify any ways in which it might benefit from cleaning.

One of the object titles doesn’t seem quite right. What do you think might have happened?

While it is not possible to know this with the information available, this object had a title in Japanese script that could not be properly rendered in our file format.

(c) Use glimpse() or the RStudio data view to see the names of the variables in the data file. One variable name includes a space. Write some code to remove this space, to make it easier to refer to that variable.

Having spaces in a variable name requires the use of back ticks whenever that variable is referred to. So a helpful step is to remove the space going forward.

We will use the rename() function to do this. You might like to replace the space with an underscore.

Remember to save the result as a new data frame with a different name from the original data, e.g. met_clean.

met_clean <- met_unclean %>%
  rename(Object_Title = `Object Title`)
glimpse(met_clean)
Rows: 11
Columns: 7
$ Department         <chr> "Drawings and Prints", "European Paintings", "Europ…
$ Object_Title       <chr> "Petrarch's Laura", "A Hare and Birds", "Alexander …
$ Artist_Name        <chr> "Enea Vico", "Jan Fyt", "Pietro Testa", "Thomas Wij…
$ Artist_Nationality <chr> "Italian", "Flemish", "Italian", "Dutch", "French",…
$ Artist_Birth_Year  <dbl> 1523, 1611, 1612, 1616, 1741, 1838, 1886, 1894, 190…
$ Object_Age         <dbl> 477, 389, 375, 404, 234, 132, 97, 106, 62, 9999, 33
$ Medium             <chr> "Engraving", "Oil on canvas", "Oil on canvaSS", "Et…

Note that we could also use the clean_names function in the janitor package to change all the variable names in a systematic way (e.g. remove all spaces) for an entire data frame.

(d) Produce a frequency table of the Medium variable. Can you see the error? Write some code to correct it.

Use the replace function within the mutate function to correct this.

Produce a frequency table using tabyl to check the results.

met_unclean %>%
  tabyl(Medium) %>%
  adorn_pct_formatting() %>%
  gt()
Medium n percent
Engraving 1 9.1%
Etching 1 9.1%
Graphite, ink, and watercolor 1 9.1%
Oil on canvaSS 1 9.1%
Oil on canvas 2 18.2%
Pewter 1 9.1%
Polychrome woodblock print 1 9.1%
Red chalk 1 9.1%
Silver dye bleach print 1 9.1%
leather 1 9.1%
met_clean <- met_unclean %>%
  mutate(Medium = replace(Medium, Medium == "Oil on canvaSS", "Oil on canvas"))
met_clean %>%
  tabyl(Medium) %>%
  adorn_pct_formatting() %>%
  gt()
Medium n percent
Engraving 1 9.1%
Etching 1 9.1%
Graphite, ink, and watercolor 1 9.1%
Oil on canvas 3 27.3%
Pewter 1 9.1%
Polychrome woodblock print 1 9.1%
Red chalk 1 9.1%
Silver dye bleach print 1 9.1%
leather 1 9.1%

It is always worth checking errors with the source, to ensure the correction is appropriate.

The advantage of this approach for correction is that it is well-documented, reproducible and easy to amend.

(e) Produce a summary of the object age variable. Can you see that some missing data has been coded unhelpfully? Write some code to change these missing values into the R missing value code.

Use the summarise function to look at the object age variables, as it is numeric. Use na_if within the mutate function to ensure the missing data point is stored in a more useful form.

met_unclean %>%
  summarise(Mean = mean(Object_Age), 
            SD = sd(Object_Age), 
            Min = min(Object_Age),
            Med = median(Object_Age),
            Max = max(Object_Age),
            n = n()) %>%
  gt()
Mean SD Min Med Max n
1118.909 2949.388 33 234 9999 11
met_clean <- met_unclean %>%
  mutate(Object_Age = na_if(Object_Age, 9999))
met_clean %>%
  summarise(Mean = mean(Object_Age, na.rm=T), 
            SD = sd(Object_Age, na.rm=T), 
            Min = min(Object_Age, na.rm=T),
            Med = median(Object_Age, na.rm=T),
            Max = max(Object_Age, na.rm=T),
            n = n()) %>%
  gt()
Mean SD Min Med Max n
230.9 165.7645 33 183 477 11

(f) Put all of your code together and create a new, clean data set with all the changes you have made.

This should be one pipeline that starts with the met_unclean data frame and produces a clean data frame called met_clean.

met_clean <-
  met_unclean %>%
    rename(Object_Title = `Object Title`) %>%
    mutate(Medium = replace(Medium, Medium == "Oil on canvaSS", "Oil on canvas"),
           Object_Age = na_if(Object_Age, 9999))

glimpse(met_clean)
Rows: 11
Columns: 7
$ Department         <chr> "Drawings and Prints", "European Paintings", "Europ…
$ Object_Title       <chr> "Petrarch's Laura", "A Hare and Birds", "Alexander …
$ Artist_Name        <chr> "Enea Vico", "Jan Fyt", "Pietro Testa", "Thomas Wij…
$ Artist_Nationality <chr> "Italian", "Flemish", "Italian", "Dutch", "French",…
$ Artist_Birth_Year  <dbl> 1523, 1611, 1612, 1616, 1741, 1838, 1886, 1894, 190…
$ Object_Age         <dbl> 477, 389, 375, 404, 234, 132, 97, 106, 62, NA, 33
$ Medium             <chr> "Engraving", "Oil on canvas", "Oil on canvas", "Etc…

Remember, if you don’t assign all your hard work to an object, it won’t be saved anywhere.

(g) Extension exercises

Download the airport screening file used in lectures and attempt to perform your own cleaning exercise.


© 2023 Statistical Consulting Centre, The University of Melbourne.